Scale-invariant and Temporal-consistent Monocular Video Geometry Estimation

MVGE generates temporally consistent and scale-invariant 3D geometry from monocular videos with superior accuracy across extended sequences.



Abstract

We present MVGE, a novel approach for estimating 3D geometry from extended monocular video sequences, where existing methods struggle to maintain both geometric accuracy and temporal consistency across hundreds of frames. Our approach generates affine-invariant 3D point maps with shared parameters across entire sequences, enabling consistent scale-invariant representations. We introduce three key innovations: viewpoint-invariant geometry aligning multi-perspective points in a unified reference frame; appearance-invariant learning enforcing consistency across exponential timescales; and frequency-modulated positioning enabling extrapolation to sequences vastly exceeding training length. Experiments across diverse datasets demonstrate significant improvements, reducing relative point map error by 24.2% and temporal alignment error by 34.9% on ScanNet compared to state-of-the-art methods. Our approach handles challenging scenarios with complex camera trajectories and lighting variations while efficiently processing extended sequences in a single pass. Code will be publicly released, and we encourage readers to explore the interactive demonstrations in our supplementary materials.

Framework

Overview of MVGE. Top-Left: MVGE consists of a ViT backbone that processes video input frames, followed by a temporal decoder with cross-attention and dynamic NTK scaling RoPE, producing scale-invariant point maps. Top-Right: Cross-frame geometric consistency enforced across global and local geometric levels (G1, G2) to maintain structural coherence across frames. Bottom-Left: RoPE with dynamic NTK scaling applied to extend sequence context, using frequency scaling that adaptively weights dimensions based on scale factor, and train-time sequence stretching that creates a virtual extended sequence to sample positions. Bottom-Right: Hierarchical temporal consistency constraints applied multiple temporal strides (δ = 1, 2, 4, 8) to enforce smooth, consistent point map predictions across time.

Comparison with VGGT on Open-World Videos

Comparison with MoGe on Open-World Videos

Portrait Video Processing

4D Scene Reconstruction

Long-Range Temporal Inference